Skip to content

fix: improve driver-side timeout logging with pool/channel diagnostics (4.x)#883

Merged
dkropachev merged 1 commit into
scylladb:scylla-4.xfrom
nikagra:mhradovich/4.x-driver-540-timeout-diagnostics
May 13, 2026
Merged

fix: improve driver-side timeout logging with pool/channel diagnostics (4.x)#883
dkropachev merged 1 commit into
scylladb:scylla-4.xfrom
nikagra:mhradovich/4.x-driver-540-timeout-diagnostics

Conversation

@nikagra
Copy link
Copy Markdown

@nikagra nikagra commented May 11, 2026

Fixes: DRIVER-540

4.x port of #867 (which targeted scylla-3.x).

Supersedes #874 (closed — AI had committed directly onto scylla-4.x instead of a feature branch).

Problem

When a DriverTimeoutException fires, the existing log output gives no information about why it happened. There are two distinct failure modes that look identical:

  1. Pool contention — requests queue inside the driver waiting to acquire a stream ID; the cumulative wait pushes elapsed time past the configured timeout. The request may never have reached the server.
  2. Server-side slowness — requests were sent but the server took too long to respond (e.g. reader_concurrency_semaphore stall). The driver gave up, but the server may still be processing them, orphaning stream IDs.

Without knowing which nodes had in-flight requests and what the pool state looked like at timeout time, diagnosing which scenario occurred requires guesswork.

Changes

DriverTimeoutException (public API)

A new NodeDiagnostics public inner class is added with fields captured at timeout time:

Field Meaning
channelInFlight Requests currently awaiting a response on this specific connection
poolInFlight Total in-flight across all connections to this host
poolAvailableIds Remaining stream IDs available to send new requests — low = pool contention
poolOrphanedIds Stream IDs from previously timed-out/cancelled requests that can't be released yet — high = stale stream ID accumulation

Pool fields are set to UNAVAILABLE (-1) when the pool was removed between dispatch and timeout.

A new constructor DriverTimeoutException(String baseMessage, List<NodeDiagnostics>) generates the full message internally (per reviewer request). The existing single-arg constructor is unchanged.

getNodeDiagnostics() allows callers to inspect fields programmatically, consistent with the 3.x OperationTimedOutException approach.

Message format

Query timed out after PT0.5S — nodes in flight: /10.0.0.1:9042 [channel in-flight: 5, pool in-flight: 12, pool available ids: 988, pool orphaned ids: 2], /10.0.0.2:9042 [channel in-flight: 3, pool in-flight: 8, pool available ids: 992, pool orphaned ids: 0]

Diagnosing failure modes:

  • poolAvailableIds near zero → pool contention; requests queuing inside the driver before reaching the server
  • poolAvailableIds normal + high channelInFlight → server is slow; requests sent but not answered within the timeout
  • High poolOrphanedIds → previous timeouts consumed stream IDs that the driver is still waiting to reclaim

Handler files

Each handler builds a List<NodeDiagnostics> at the moment the timer fires and passes it to the new constructor:

  • CqlRequestHandler.java — iterates inFlightCallbacks (multi-node list)
  • GraphRequestHandler.java — same pattern
  • CqlPrepareHandler.java — single-element list from initialCallback
  • ContinuousRequestHandlerBase.javascheduleGlobalTimeout (multi-node list) + onPageTimeout (single-element list)

The per-handler buildTimeoutMessage() / buildGlobalTimeoutMessage() helpers are removed; message generation now lives in DriverTimeoutException.buildMessage().

Architecture differences from 3.x

In 3.x, the timeout is per-node (SpeculativeExecution.onTimeout()). In 4.x, a single timer covers the entire request, but multiple nodes can be in-flight simultaneously (speculative executions + retries).

3.x 4.x
RequestHandler.SpeculativeExecution.onTimeout() CqlRequestHandler.scheduleTimeout()
HostConnectionPool.pendingBorrowCount ChannelPool.getAvailableIds() (inverse indicator)
HostConnectionPool.totalInFlight ChannelPool.getInFlight()
Connection.inFlight DriverChannel.getInFlight()

Notes

  • No logic changes; diagnostics only.
  • All accessed state is safe from the Netty timer thread: inFlightCallbacks is a CopyOnWriteArrayList, session.getPools() returns a ConcurrentHashMap, and pool/channel stats use atomic operations.
  • Pool lookup is null-checked defensively (host may be removed between dispatch and timeout).
  • All existing test assertions use substring matching and pass unchanged (411 unit tests green).

@nikagra nikagra force-pushed the mhradovich/4.x-driver-540-timeout-diagnostics branch 3 times, most recently from 38daf16 to c2d9d18 Compare May 11, 2026 15:28
@nikagra nikagra requested review from Copilot and dkropachev and removed request for dkropachev May 11, 2026 16:18
@nikagra nikagra requested a review from dkropachev May 11, 2026 16:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the debuggability of driver-side request timeouts by enriching DriverTimeoutException with per-node pool/channel diagnostics captured at the moment the timeout fires, and by having request handlers provide those diagnostics when constructing the exception.

Changes:

  • Add DriverTimeoutException.NodeDiagnostics, UNAVAILABLE sentinel, and a new constructor that generates an enriched timeout message and exposes diagnostics programmatically.
  • Update CQL, Graph, and Continuous Paging request handlers to collect per-node diagnostics on timeout and pass them into the new exception constructor.
  • Relax a couple of Simulacron integration tests to accept the enriched timeout message format.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
integration-tests/src/test/java/com/datastax/oss/driver/core/cql/SimpleStatementSimulacronIT.java Adjust assertion to tolerate enriched timeout messages.
integration-tests/src/test/java/com/datastax/oss/driver/core/cql/BoundStatementSimulacronIT.java Adjust assertions to tolerate enriched timeout messages.
core/src/main/java/com/datastax/oss/driver/internal/core/cql/CqlRequestHandler.java Capture node/channel/pool diagnostics at timeout time and pass to DriverTimeoutException.
core/src/main/java/com/datastax/oss/driver/internal/core/cql/CqlPrepareHandler.java Capture node/channel/pool diagnostics for prepare timeouts and pass to DriverTimeoutException.
core/src/main/java/com/datastax/oss/driver/api/core/DriverTimeoutException.java Introduce NodeDiagnostics API, new constructor, diagnostics accessor, and message building.
core/src/main/java/com/datastax/dse/driver/internal/core/graph/GraphRequestHandler.java Capture diagnostics for Graph request timeouts and pass to DriverTimeoutException.
core/src/main/java/com/datastax/dse/driver/internal/core/cql/continuous/ContinuousRequestHandlerBase.java Capture diagnostics for global/page continuous paging timeouts and pass to DriverTimeoutException.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread core/src/main/java/com/datastax/oss/driver/api/core/DriverTimeoutException.java Outdated
@nikagra nikagra force-pushed the mhradovich/4.x-driver-540-timeout-diagnostics branch from c2d9d18 to 526867d Compare May 11, 2026 16:35
Comment thread core/src/main/java/com/datastax/oss/driver/api/core/DriverTimeoutException.java Outdated
…tics (DRIVER-540)

Add NodeDiagnostics public inner class to DriverTimeoutException with fields
for in-flight counts and pool capacity, and generate a diagnostic suffix in
the exception message at timeout time. Refactor all four request-handler
timeout paths (CqlRequestHandler, CqlPrepareHandler, GraphRequestHandler,
ContinuousRequestHandlerBase) to build List<NodeDiagnostics> instead of a
raw message string. Update IT assertions that matched exact message strings
to use hasMessageStartingWith() to accommodate the new suffix.
@nikagra nikagra force-pushed the mhradovich/4.x-driver-540-timeout-diagnostics branch from 526867d to b749f92 Compare May 12, 2026 14:54
@nikagra nikagra marked this pull request as ready for review May 12, 2026 17:41
@dkropachev dkropachev merged commit 441eb42 into scylladb:scylla-4.x May 13, 2026
38 of 40 checks passed
@nikagra nikagra deleted the mhradovich/4.x-driver-540-timeout-diagnostics branch May 13, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants